-
Notifications
You must be signed in to change notification settings - Fork 748
Qualcomm AI Engine Direct - Static Decoder Runner Support 16bit KV IO #13127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13127
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit f7954d3 with merge base c8a0706 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
2ee4964 to
f9efbc5
Compare
645a69c to
7bed64b
Compare
|
There seem to be merge conflict.. |
7bed64b to
f7954d3
Compare
|
Hi @cccclai, I wonder if we could have this merged? It would be great to have this and we can submit PR for statistics real quick. |
Yeah trying to merge but ran into merge conflict, checking again |
cccclai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change!
…pytorch#13127) ### Summary - Support 16bit KV IO for runner. (Capable to run either 8bit or 16bit) - Adding README for script to run Qwen2.5 0.5B - Improving the PPL score for Qwen2.5 0.5B from 18->12. - Fixing BC CI bug. Sample Script `python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1 --artifact ./16bit_qwen_1024 --enable_masked_softmax --r3` #### Stats with QNN2.37.0 on SM8750 Accuracy: 12ppl (Align with prepare_pt2e and convert_pt2e) Token Rate: ~130tok/sec, depending on seq_len. <img width="1658" height="877" alt="image" src="https://github.com/user-attachments/assets/8fa19068-5613-4329-a527-52f3e02d408f" /> ### Test plan Added E2E test to `test_qnn_delegate.py`
Summary
Sample Script
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1 --artifact ./16bit_qwen_1024 --enable_masked_softmax --r3Stats with QNN2.37.0 on SM8750
Accuracy: 12ppl (Align with prepare_pt2e and convert_pt2e)

Token Rate: ~130tok/sec, depending on seq_len.
Test plan
Added E2E test to
test_qnn_delegate.py